It was found that training the model with all the available features using HIPPIE as a training set does not produce useful weightings. This problem was solved by re-evaluating the approach of the project and applying Bayesian principles to get a set of weights which were usable. However, this leaves the classification results almost useless. A more principled way to perform the classification would be to only apply the classification algorithm to the indirect data and integrate the direct data by hand - updating on it as evidence based on estimated error rates.
This notebook involves creating the training set for the classifier that will be trained only on indirect data. All available primary interaction databases will be integrated into this to create a list of high-confidence positive interactions. At the same time, a list of negative interactions will be generated.
The databases which will be combined to form this list of positive interactions will be:
In [3]:
cd ../../iRefIndex
In [1]:
import pickle
In [5]:
import sys
sys.path.append("opencast-bio/")
In [6]:
import ocbio.irefindex
In [7]:
f = open("human.iRefIndex.Entrez.1ofk.pickle")
irefin = pickle.load(f)
f.close()
In [8]:
import csv
In [12]:
f = open("human.iRefIndex.positive.pairs.txt","w")
c = csv.writer(f,delimiter="\t")
for pair in irefin.featuredict:
pair = list(pair)
c.writerow(pair)
f.close()
In [14]:
print "Number of positive interactions: {0}".format(len(irefin.featuredict.keys()))
In [15]:
print "Number of negative interactions required: {0}".format(600*len(irefin.featuredict.keys()))
This is not a feasible number of interactions to write to file. However, this would also be many more interactions than is required to train a classifier. Therefore, we only need to write enough negative interactions to produce training sets up to the size required during training. One million negative samples will be more than enough.
In [16]:
import itertools
In [25]:
ids = list(set(flatten(irefin.featuredict.keys())))
shuffle(ids)
In [26]:
f = open("human.iRefIndex.negative.pairs.txt","w")
c = csv.writer(f,delimiter="\t")
for i,pair in enumerate(itertools.combinations(ids,2)):
if frozenset(pair) not in irefin.featuredict:
c.writerow(pair)
if i > 1000000:
break
f.close()
In [28]:
!head human.iRefIndex.negative.pairs.txt